PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration #1141

amousavigourabi · 2023-09-16T09:57:44Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Additional parameters to run the read/write tests using the new interface-using methods.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

jacicmp exclusions have been added for the following classes: org.apache.parquet.hadoop.CodecFactory, org.apache.parquet.hadoop.ParquetReader. When these exclusions are removed, the following incompatibilities are detected:

There is at least one incompatibility: org.apache.parquet.hadoop.CodecFactory.configuration:FIELD_TYPE_CHANGED,org.apache.parquet.hadoop.ParquetReader$Builder.conf:FIELD_TYPE_CHANGED

This PR is part of an effort that has been discussed on the dev mailing list.

Add interface layer for the Hadoop Configuration.

wgtmac

Thanks for doing this! I have left some inline comments.

BTW, I am also curious of the following things.

It seems that we still have some compatibility issues. Can you confirm? If yes, could you please write them out explicitly?
Is there any follow-up work item to do? Would be good if we can know the whole picture in advance.
Is it possible to add a simple test case to prove that a simple writer and reader roundtrip can work successfully without Hadoop dependency?

parquet-pig/src/main/java/org/apache/parquet/pig/TupleReadSupport.java

parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java

wgtmac · 2023-09-24T14:45:40Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ConfigurationUtil.java

+    if (conf instanceof HadoopParquetConfiguration) {
+      return ((HadoopParquetConfiguration) conf).getConfiguration();
+    }
+    Configuration configuration = new Configuration();


When will it happen?

When using a HadoopParquetConfiguration, the user did not yet decouple from Hadoop as it is just a wrapper for Configuration. When the user wants to decouple from Hadoop, they can implement their own ParquetConfiguration, which does not rely on Hadoop's Configuration (or a simple implementation is added afterwards, this PR was getting a bit large for that already). There is still some code right now, mainly around the codecs which needs a Hadoop Configuration. It is therefore important that while we're still removing these last references to Hadoop, we can get such an instance from a ParquetConfiguration, in order not to break anything.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/InitContext.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

pom.xml

parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java

parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java

…tContext.java Co-authored-by: Gang Wu <[email protected]>

amousavigourabi · 2023-09-24T16:52:05Z

@wgtmac thanks a lot for the review! It's quite a big one so I really appreciate you took the time for it.

To address your concerns:

It seems that we still have some compatibility issues. Can you confirm? If yes, could you please write them out explicitly?

Correct, japicmp comes up with two incompatibilities. These are in the classes CodecFactory and ParquetReader. The incompatibilities it points to are both changes of private and protected field types from Configuration to ParquetConfiguration. These changes are strictly necessary for the effort to unhadoop the read/write API.

Is there any follow-up work item to do? Would be good if we can know the whole picture in advance.

After this, the following steps that will need to be taken are: 1. the creation of a simple unhadooped implementation of the ParquetConfiguration interface, and 2. adding a simple way for users to avoid the Hadoop codecs, as the OOTB implementations of everything still rely very heavily on Hadoop classes. These changes should allow users to drop the Hadoop runtime dependency. The Hadoop client API dependency will still be necessary.

Is it possible to add a simple test case to prove that a simple writer and reader roundtrip can work successfully without Hadoop dependency?

We do not yet have any serious ways for users to use the API without Hadoop dependencies. The added parameters to the TestReadWrite fixture make sure the read/write API should still function when using the ParquetConfiguration interface.

wgtmac

Thanks for the detail explanation! This LGTM. Probably you can also create JIRA issues in advance for all planned work items.

I would listen to feedback from more maintainers to proceed. @gszadovszky @shangxinli @ggershinsky @Fokko

wgtmac · 2023-09-27T02:01:24Z

pom.xml

@@ -547,6 +547,8 @@
            </excludeModules>
            <excludes>
              <exclude>${shade.prefix}</exclude>
+              <exclude>org.apache.parquet.hadoop.CodecFactory</exclude> <!-- change field type from Configuration to ParquetConfiguration -->


Just curious: can we remove this in a future PR? Only this PR introduces the breaking change here.

I'm not that well versed in how japicmp works exactly, but I reckon this can be removed after the next minor release.

wgtmac · 2023-10-12T01:41:54Z

@shangxinli @gszadovszky Do you have any comment? If not, I plan to merge this next week. Thanks!

@amousavigourabi Could you please resolve the conflicts? Thank you!

gszadovszky · 2023-10-12T06:57:18Z

@wgtmac, @amousavigourabi, sorry, but I won't have the time to properly review this PR in the upcoming days.
The concept looks good to me so I am giving a +0 for this patch.

parquet-common/src/main/java/org/apache/parquet/conf/ParquetConfiguration.java

parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectZstd.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java

ConeyLiu · 2023-10-13T15:14:42Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/ReadSupport.java

+   *
+   * @deprecated override {@link ReadSupport#init(InitContext)} instead
+   */
+  @Deprecated


Do we still need to add a deprecated method?

This PR is focussed on transitioning from Configuration to the ParquetConfiguration interface. This included some calls to deprecated methods which I could not very quickly transition away from. I would consider this out-of-scope for this PR.

ConeyLiu · 2023-10-13T15:16:49Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/ReadSupport.java

+      Map<String, String> keyValueMetaData,
+      MessageType fileSchema,
+      ReadContext readContext) {
+    throw new UnsupportedOperationException("Override prepareForRead(ParquetConfiguration, Map<String, String>, MessageType, ReadContext)");


The error message should be more meaningful.

I follow the example set by ReadSupport#init(Configuration, Map, MessageType). As this error will not occur unless you are implementing your own ReadSupport class, I am not sure whether there needs to be that much more information in the exception. I'll add a reference to the ReadSupport class though.

ConeyLiu · 2023-10-13T15:17:28Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/WriteSupport.java

+   * @return the information needed to write the file
+   */
+  public WriteContext init(ParquetConfiguration configuration) {
+    throw new UnsupportedOperationException("Override init(ParquetConfiguration)");


parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java

parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java

parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftRecordConverter.java

ConeyLiu · 2023-10-13T15:30:41Z

@amousavigourabi we have added so many public methods to keep the backward compatibility. So which one is preferred (I think it should be theParquetConfiguration)? Should we deprecate the old one?

amousavigourabi · 2023-10-13T15:52:16Z

@amousavigourabi we have added so many public methods to keep the backward compatibility. So which one is preferred (I think it should be theParquetConfiguration)? Should we deprecate the old one?

the ParquetConfiguration methods are indeed preferred. I would be in favour of deprecating the Hadoop Configuration methods. However, I do think deprecating these methods might warrant a separate discussion on the mailing list (as their deprecation and subsequent removal would go further than what we have previously discussed, it would be a step in the direction of eliminating the need for hadoop-client-api altogether instead of just hadoop-client-runtime) and does not necessarily need to be a part of this PR.

wgtmac · 2023-10-14T14:29:35Z

Thank you for the review @ConeyLiu. This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet.

ConeyLiu

Thanks, @amousavigourabi for the response and your efforts. LGTM.

This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet.

@wgtmac This is a good start. It would be nice to have a best practice guide doc to help users understand the different configuration usage.

wgtmac · 2023-10-15T15:54:38Z

I totally agree with you. @ConeyLiu
We might need to consolidate different docs for configuration like this (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md) into a better space.

shangxinli · 2023-10-15T16:43:03Z

parquet-common/src/main/java/org/apache/parquet/conf/ParquetConfiguration.java

+   * @param name the property to set
+   * @param value the value to set the property to
+   */
+  void set(String name, String value);


setString()?

I just went with the way Hadoop's Configuration does it for this and other methods in the interface, in order to not have to change references to conf.set() everywhere.

wgtmac · 2023-10-26T05:29:54Z

I will merge this if there is no more comment by the end of this week. Thanks @amousavigourabi for working on this!

ConeyLiu · 2023-10-30T03:01:15Z

This broke the CI checking.

ConeyLiu · 2023-10-30T03:03:45Z

parquet-common/src/main/java/org/apache/parquet/util/Reflection.java

+  private static final Map<Class<?>, Constructor<?>> CONSTRUCTOR_CACHE = new ConcurrentHashMap<Class<?>, Constructor<?>>();
+
+  @SuppressWarnings("unchecked")
+  public static <T> T newInstance(Class<T> theClass) {


Not sure why this breaks the API checking. However, this class seems redundant since there is not any usage.

Hmm, this is strange. Should we exclude this class in the japicmp config? @amousavigourabi

As there is no current usage, we can get away with deleting this class for now. My main concern with the CI failure is the incompatibility at org.apache.parquet.conf.ParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED japicmp complains about. Especially as this did not come up before and the interface is completely new I have no idea why this is happening. Neither of the matters japicmp flags should be an issue AFAIK, so that might be something to investigate in our configuration?

I think this should be a bug of japicmp. Because both of the methods are generic template methods. And it should be safe to suppress the checking.

I'll open a new PR shortly, deleting the unused stuff and suppressing the warnings.

It has also started failing at other completely new methods in the Parquet Hadoop module: org.apache.parquet.conf.HadoopParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED,org.apache.parquet.hadoop.util.SerializationUtil.readObjectFromConfAsBase64(java.lang.String,org.apache.parquet.conf.ParquetConfiguration):CLASS_GENERIC_TEMPLATE_CHANGED. Might be something to report to the japicmp maintainer?

I tested 0.16.0 has no problem. A little strange.

Probably the current 0.18.1 is more aggressive than 0.16.0

Fokko · 2024-04-23T12:39:24Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

@@ -48,7 +51,7 @@ public class CodecFactory implements CompressionCodecFactory {
  private final Map<CompressionCodecName, BytesCompressor> compressors = new HashMap<>();
  private final Map<CompressionCodecName, BytesDecompressor> decompressors = new HashMap<>();

-  protected final Configuration configuration;


This causes the Iceberg build to break when bumping to 1.14.0-SNAPSHOT:

> Task :iceberg-parquet:compileJava FAILED /Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:60: error: cannot find symbol codecClass = configuration.getClassLoader().loadClass(codecClassName); ^ symbol: method getClassLoader() location: variable configuration of type ParquetConfiguration /Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:62: error: incompatible types: ParquetConfiguration cannot be converted to Configuration codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, configuration);

Does it make sense to fix this on the iceberg side?

Anyway this is a breaking change. We have to fix it before the release.

For Iceberg this is not an issue, since we would remove this class with the 1.14 release (that contains #1134):
https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java#L28-L32

In Iceberg we shade Parquet, so this will not interfere with other libraries that use a different version of Parquet. I'm curious what if we also have this issue on the Spark side (cc @vinooganesh).

amousavigourabi force-pushed the add-interface-layer-for-configuration branch from aeb2348 to a2ce8af Compare September 17, 2023 21:54

PARQUET-2347: Add interface layer for Configuration

7db4478

Add interface layer for the Hadoop Configuration.

amousavigourabi force-pushed the add-interface-layer-for-configuration branch from a2ce8af to 7db4478 Compare September 19, 2023 16:35

wgtmac reviewed Sep 24, 2023

View reviewed changes

wgtmac requested review from gszadovszky and shangxinli September 24, 2023 15:32

amousavigourabi and others added 3 commits September 24, 2023 18:25

Update parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/Ini…

bf43bbd

…tContext.java Co-authored-by: Gang Wu <[email protected]>

Address comments

c5cd17c

Add comments at compatibility exclusions

3315ba8

amousavigourabi requested a review from wgtmac September 24, 2023 16:52

Remove unnecessary Configuration

8ff97ea

wgtmac approved these changes Sep 27, 2023

View reviewed changes

amousavigourabi requested a review from wgtmac October 11, 2023 13:42

Merge branch 'master' into add-interface-layer-for-configuration

edb3158